Corpus-Based Adaptation Mechanisms for Chinese Homophone Disambiguation

نویسنده

  • Chao-Huang Chang
چکیده

Based on the concepts of bzd~rectwnal converswn and automahc evaluatzon, we propose two user. adaptation mechanzsms, character-preference learn. in9 and pseudo-word learning, for resolving Chinese homophone ambiguities in syllable-to.character conversion. The 1991 Umted Daily corpus of approximately 10 million Chinese characters ts used for extraction of 10 reporter-specific article databases and .[or computat,on of word frequencies and character higrams. Ezpemments show that ~0.5 percent (testing sets) to 71.8 percent (trammg sets) of conversion er. rots can be eliminated through the proposed mechanisms. These concepts are thus very useful tn apphcattons such as Chinese znput methods and speech recognition systems. 1 I n t r o d u c t i o n Corpus-based Chinese NLP research has been very active in the recent years as more and more computer readable Chinese corpora are available. Reported corpus-based NLP applications [10] include machine translation, word segmentation, character recognition, text classification, lexicography, and spelling checker. In this paper, we will describe our work on adaptive Chinese homophone disambiguation (also known as phonetic-input-to-character conversion or phonetic decoding) using part of the 1991 United Daily (UD) corpus of approximately 10 million Chinese characters (Hanzi). It requires a coding method, structural or phonetic, to input Chinese characters into a computer, since there are more than I0,000 of them in common use. In the literature [3,7], there are several hundred different coding methods for this purpose. For most users, phonetic coding (Pinyin or Bopomofo) is the choice. To input a Chinese character, the user simply keys in its corresponding phonetic code. It is easy to learn, but suffers from the homophone problem, i.e., a phonetic code corresponding to several different characters. Therefore, the user needs to choose the desired character from a (usually long) list of candidate characters. It is inefficient and annoying. So, automatic homophone disambiguation is highly desirable. Several disambiguation approaches have been reported in the literature [3, 7]. Some of them have even been realized in commercial input methods, e.g., ttanin, WangXing, Going. However, the accuracies of these disambiguators are not satisfactory. In this paper, we propose a corpus-based adaptation method for improving the accuracy of homophone disambiguation. For homophone disambiguation, what we need as input is syllable (phonetic code) corpora instead of text corpora. For adaptation, what we need is personal corpora instead of general corpora (such as the UD corpus). Thus, we first design a selection procedure to extract articles by individual reporters. Ten personal corpora were set up in this way. An additional domain-specific corpus, translated AP news, was built up similarly. Then, we design a highly-reliable (99.7% correct) character-tosyllable converter [I] to transfer the text corpora into syllable corpora. Our baseline disambiguator is rather conventional, composed of a word-lattice searching module, a path scorer, and a lexicon-driven word hypothesizer. Using the original text corpora and the corresponding syllable corpora, we propose a user-adaptation method, applying the concept of bidirectional conversion [I] and automatic evaluation [2]. The adaptation method includes two parts: character-preference learning and pseudo word learning. Given a personal corpus (i.e., sample text), the adaptation pro-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Adaptive Learning Algorithm for Task Adaptation in Chinese Homophone Disambiguation1

Task adaptation from a set of run-time feedback information has become increasingly crucial for corpus-based natural language applications owing to the variant run-time environment. An order-based adaptive learning algorithm is proposed in this paper for task adaptation to best-fit the run-time environment in the application of Chinese homophone disambiguation. It shows which objects to be adju...

متن کامل

An Adaptive Algor ithm for Learning Changes in Run-Time Context Domain

The run-time context domain has much effect on the performance of practical corpus-based applications. Previous smoothing techniques, and class-based and similarity-based models cannot handle the dynamic status perfectly. In this paper, an adaptive learning algorithm is proposed for task adaptation to fit best the run-time context domain in the application of Chinese homophone disambiguation. I...

متن کامل

Applying Repair Processing in Chinese Homophone Disambiguation

Repair processing plays an important role in spoken language processing systems. This paper proposes a method for correcting Chinese repetition repairs and demonstrates the effects of repair processing in Chinese homophone disambiguation. The experimental results show that the precision rate of 93.87% and the recall rate of 90.65% can be achieved for the repair processing. At the same time, 50%...

متن کامل

Large Span statistical language models: application to homophone disambiguation for large vocabulary speech recognition in French

Homophone words is one of the specific problems of Automatic Speech Recognition (ASR) in French. Moreover, this phenomenon is particularly high for some inflections like the singular/plural inflection (72% of the 40.7K lemma of our 240K word dictionary have inflected forms which are homophonic). In order to take into account worddependencies spanning over a variable number of words, it is inter...

متن کامل

Uyghur-Chinese Translation Disambiguation Method Research Based on Knowledge Automatic-Acquisition

This thesis studies the disambiguation method in Uyghur-Chinese translation, and proposes the design philosophy of automatic-acquisition in translation label library aiming at the deficiency of disambiguation corpus in Uyghur. It refers to the existing Uyghur-Chinese bilingual dictionary, Chinese corpus and the Internet, and acquires the corresponding Chinese translation label examples to Uyghu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993